Table of contents


1 Setting Up

Learning objective

  1. You can access R and RStudio, either through RStudio.cloud or by downloading and installing these software to your computer.

1.1 Introduction

To start you off on your R journey, we’ll need to set you up with the required software, R and RStudio. R is the programming language that you’ll use write code, while RStudio is an integrated development environment (IDE) that makes working with R easier.

1.2 Working locally vs. on the cloud

There are two main ways that you can access and work with R and RStudio: download them to your computer, or use a web server to access them on the cloud.

Using R and RStudio on the cloud is the less common option, but it may be the right choice if you are just getting started with programming, and you do not yet want to worry about installing software. You may also prefer the cloud option if your local computer is old, slow, or otherwise unfit for running R.

Below, we go through the setup process for RStudio Cloud, Rstudio on Windows and RStudio on macOS separately. Jump to the section that is relevant for you!

RStudio cloud will only give you 25 free project hours per month. After that, you will need to upgrade to a paid plan. If you think you’ll need more than 25 hours per month, you may want to avoid this option.

1.3 RStudio on the cloud

If you’ll be working on the cloud, follow the steps below:

  1. Go to the website rstudio.cloud and follow the instructions to sign up for a free account. (We recommend signing up with Google if you have a Google account, so you don’t need to remember any new passwords).

  2. Once you’re done, click on the “New Project” icon at the top right, and select “New RStudio Project”.

You should see a screen like this:

This is RStudio, your new home for a long time to come!

At the top of the screen, rename the project from “Untitled Project” to something like “r_intro”.

You can start using R by typing code into the “console” pane on the left:

Try using R as a calculator here; type 2 + 2 and press Enter.

That’s it; you’re ready to roll. Whenever you want to reopen RStudio, navigate to rstudio.cloud,

Proceed to the “wrapping up” section of the lesson.

1.4 Set up on Windows

1.4.1 Download and install R

If you’re working on Windows, follow the steps below to download and install R:

  1. Go to cran.rstudio.com to access the R installation page. Then click the download link for Windows:

  2. Choose the “base” sub-directory.

  3. Then click on the download link at the top of the page to download the latest version of R:

    Note that the screenshot above may not show the latest version.

  4. After the download is finished, click on the downloaded file, then follow the instructions on the installation pop-up window. During installation, you should not have to change any of the defaults; just keep clicking “Next” until the installation is done.

    Well done! You should now have R on your computer. But you likely won’t ever need to interact with R directly. Instead you’ll use the RStudio IDE to work with R. Follow the instructions in the next section to get RStudio.

1.5 Download, install & run RStudio

To download RStudio, go to rstudio.com/products/rstudio/download/#download and download the Windows version.

After the download is finished, click on the downloaded file and follow the installation instructions.

Once installed, RStudio can be opened like any application on your computer: press the Windows key to bring up the Start menu, and search for “rstudio”. Click to to open the app:

You should see a window like this:

This is RStudio, your new home for a long time to come!

You can start using R by typing code into the “console” pane on the left:

Try using R as a calculator here; type 2 + 2 and press Enter.

That’s it; you’re ready to roll. Proceed to the “wrapping up” section of the lesson.

1.6 Set up on macOS

1.6.1 Download and install R

If you’re working on macOS, follow the steps below to download and install R:

  1. Go to cran.rstudio.com to access the R installation page. Then click the link for macOS:

  2. Download and install the relevant R version for your Mac. For most people, the first option under “Latest release” will be the one to get.

  3. After the download is finished, click on the downloaded file, then follow the instructions on the installation pop-up window.

Well done! You should now have R on your computer. But you likely won’t ever need to interact with R directly. Instead you’ll use the RStudio IDE to work with R. Follow the instructions in the next section to get RStudio.

1.6.2 Download, install & run RStudio

To download RStudio, go to rstudio.com/products/rstudio/download/#download and download the version for macOS.

After the download is finished, click on the downloaded file and follow the installation instructions.

Once installed, RStudio can be opened like any application on your computer: Press Command + Space to open Spotlight, then search for “rstudio”. Click to open the app.

You should see a window like this:

This is RStudio, your new home for a long time to come!

You can start using R by typing code into the “console” pane on the left:

Try using R as a calculator here; type 2 + 2 and press Enter.

1.7 Wrap up

You should now have access to R and RStudio, so you’re all set to begin the journey of learning to use these immensely powerful tools. See you in the next session!

Contributors

The following team members contributed to this lesson:

References

Some material in this lesson was adapted from the following sources:

This work is licensed under the Creative Commons Attribution Share Alike license. Creative Commons License

2 Data visualization fundamentals

2.1 Learning objectives

  1. You can appreciate the value of a good plot and get excited about data visualization!
  2. You can interpret a plot and extract relevant information about the data.
  3. You can name the R package we will use for data visualization in the course.

2.2 Introduction

Welcome to a new chapter of Introductory Data Analysis in R. Now that you have some familiarity with writing R code and using RStudio, it’s time to get some hands-on practice with data analysis. We begin the development of your data analysis toolbox with data visualization. Visualizing our research in intriguing and comprehensible ways is essential in sharing it with peers, stakeholders, decision makers, and the public.

Our main goal with this chapter is to introduce you to both the theory and the methods of data visualization in a sensible, understandable, and reproducible way.

We will be using the package ggplot2 in R to produce high quality visualizations. ggplot2 is one of the core packages of the tidyverse. It is centred on the philosophy of the grammar of graphics (Wilkinson 2005), which will feature heavily in this course. In addition to the mechanics of writing R code for visualization, this chapter will also teach you how to use visualization of to tell a story with your data.

2.3 Why visualization?

You may have heard the saying “a picture is worth a thousand words”. Examining a figure can be more time-efficient than reading, and figures can more easily point out details and connections and be more engaging, convincing, and inspiring than text.

Not only that, but may miss unusual patterns, distributions, outliers, missing values, gaps, clustering etc. Graphics and data visualization raises questions about data which results in more exploration and research.

By visualizing data, we gain valuable insights we couldn’t initially obtain from just looking at the raw data.

2.4 The power of data visualization in action

Let’s analyse some real-world epidemiological data and demonstrate how a plot is worth a thousand data points. For this lesson, we will be analysing weekly reported measles cases at the regional level in Niger from 1995 to 2005. Niger is divided into 8 administrative regions:

As part of a 2020 study of measles dynamics in Niger, Blake et al. published this dataset :

load(here("ch03_intro_to_data_viz/data/clean/nigerm_cases_rgn.RData"))
nigerm 

From merely looking at the data frame we can learn that it contains 4576 observations of 5 variables. Each row tells us how many cases were reported in a particular week of a particular year. We can also look at the Environment pane to find out the the data class of each variable: year, week and cases are integers, and region is a factor with 8 levels.

We may be able to get a vague idea of epidemic patterns by browsing through the dataset manually, but this is extremely tedious and time-consuming.

We can get a little more information by inspecting summary statistics given by the summary() function which was introduced in the previous chapter:

nigerm %>% summary()
##       year           week           region         cases       
##  Min.   :1995   Min.   : 1.00   Agadez : 572   Min.   :   0.0  
##  1st Qu.:1997   1st Qu.:13.75   Diffa  : 572   1st Qu.:   1.0  
##  Median :2000   Median :26.50   Dosso  : 572   Median :  16.0  
##  Mean   :2000   Mean   :26.50   Maradi : 572   Mean   : 100.3  
##  3rd Qu.:2003   3rd Qu.:39.25   Niamey : 572   3rd Qu.:  86.0  
##  Max.   :2005   Max.   :52.00   Tahoua : 572   Max.   :1887.0  
##                                 (Other):1144

This gives us values for the maximum, minimum, and quartiles of each numeric variable, and the number of observations (rows) for each region. The most notable finding from this is that the distribution of case counts is very heavily skewed. It is not a normal distribution, and there may be a lot of zeros and a few extremely high values.

However, this summary omits a large amount information contained in the dataset. It doesn’t allow us to compare and contrast case numbers between regions or years, and the epidemic trends we are interested in are not apparent.

The easiest and clearest way to extract patterns from this dataset is to visualize it! As we will see, plots help us to identify patterns and outliers in our data, and compare the distribution of a numerical variable (e.g., cases) as we go across the levels of a different categorical variable (e.g., region).

We can use the ggplot() function from the ggplot2 package to produce a series of linegraphs to visualize all the data points in the nigerm dataset.

ggplot(data = nigerm, 
       mapping = aes(x = week, y = cases, group = region, colour = region)) +
  geom_line() + 
  facet_wrap(vars(year))

With just a few lines of code, we are able to get a much clearer idea of disease trends over time. For example, we can see that the measles outbreaks rise and fall in classic bell-shaped epidemic curves, with most peaks occurring in the second quarter of each calendar year. This indicates that there may be seasonal patterns driving measles transmission. Plotting the data also allows to compare the relative magnitude of the outbreaks between regions and between years.

In this chapter, you will learn how to harness the power of visualization in R to explore data and to present data to others. In addition to learning the mechanics of learning to write code for visualizations, you will also learn how to use the art of visualization to tell a story with your data.

Contributors

The following team members contributed to this lesson:

References

Some material in this lesson was adapted from the following sources:

This work is licensed under the Creative Commons Attribution Share Alike license. Creative Commons License